Method and apparatus for rejection of speech recognition results in accordance with confidence level

ABSTRACT

An automatic speech recognition system for continuous speech recognition of vocabulary words for an autoattendent system proving hand-free telephone calling and utilizing a vocabulary comprising numbers or names of people to be called using known techniques for automatic speech recognition models of word sequencing resulting in high confidence levels of recognition.

BACKGROUND OF THE INVENTION

[0001] The present invention relates to automatic speech recognition,and more particularly to an efficient system and method for continuousspeech recognition of vocabulary words. A source for examples of theprior art and prior art mathematical techniques is Delaney, D. W. “VoiceUser Interface for Wireless Internetworking,” Qualifying ExaminationReport,” Georgia Institute of Technology; School of Electrical andComputer Engineering; Atlanta, Ga. Jan. 30, 2001.

[0002] Automatic speech recognition is an important element of wirelessconnectivity. Pocket-sized devices having small screens and no keyboardwill be enabled by speech technology to allow users to interact withsystems in a natural manner. Similarly, automatic speech recognition isnecessary for an autoattendent system providing hand-free telephonecalling in which a user requests that a telephone number be dialed for aperson whose name is spoken by the user. While this application is butone of many for the present invention, it invokes many issues to beaddressed in automatic speech recognition. The automatic speechrecognition unit must include a vocabulary. In the present example, thevocabulary comprises names of people to be called. Known techniques forautomatic speech recognition create stochastic models of word sequencingusing training data. Then P(O|W) is estimated. This is the probabilitythat that a particular set of acoustic observations, O corresponds to amodel of a word W.

[0003] An important technique for deriving correlation of particularspoken sounds to models is the Hidden Markov Model. The Hidden MarkovModel is provided to operate on outputs from audio circuitry which grabsa sample of N frames for a given sound. A language is resolved intophonemes, which are the abstract units of a phonetic system of alanguage that correspond to a set of similar speech sounds which areperceived to be a distinctive sound in the language. The apparatusdetects phones from the samples of N frames. A phone is the acousticmanifestation of one or more linguistically-based phonemes orphoneme-like items. Each known word includes one or more phones.

[0004] Qualitatively, the decoder may be viewed as comparing one or morerecognition models to features associated with an unknown utterance. Theunknown utterance is recognized by the known words associated with therecognition model with which the test pattern most closely matches.Recognition model parameters are estimated from static training datastored during an initial training period.

[0005] The Hidden Markov Model (HMM) can best be described as aprobabilistic state machine for the study of time series. In speechrecognition, the time series is given by an observation vector O. Theobservation vector O=(O₁O₂, . . . OT) where each O_(i) is anacoustically meaningful vector of speech data for the “i”th frame. HMMsare Markov chains whose state sequence is hidden by the outputprobabilities of each state. An HMM with N states is indexed as {s₁, s₂,. . . , s_(N)}. A state, s_(k) contains an output probabilitydistribution B which describes the probability that a particularobservation is produced by that state. B can be either discreet orcontinuous. The HMM has an initial state distribution, π, whichdescribes the probability of starting in any one of the N states. Forconvenience in notation, the entire HMM can be written as π=(ABπ).Speech recognition is primarily interested in the probability P(O|π).The results of such decoding are not certain. The result could beresponse to out of vocabulary words (OOVs) or another misrecognition.Such a misrecognition will generate the wrong telephone call. Practicalsystems must try to detect a speech recognition error and reject aspeech recognition result when the result is not reliable. Prior systemshave derived rejection information from acoustic model level data,language model level data and parser level data. Such data requires agood deal of processing power, which increases expense of practicalimplementations and adds difficulty in achieving real time operation.

BRIEF DESCRIPTION OF THE DRAWINGS

[0006] The invention is illustrated by way of example, and notlimitation, in the figures. Like reference numerals denote like elementsin the various figures.

[0007]FIG. 1 is a block diagrammatic representation of a systemincorporating the present invention.

[0008]FIG. 2 is a block diagrammatic representation of the method of thepresent invention which also serves to illustrate the computer programproduct of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

[0009]FIG. 1 is a block diagrammatic representation of a systemcomprising the present invention. In the example of FIG. 1, a user 1wishes to make a hands-free call, and speaks into a microphone 3 viasound waves 2. The microphone 3 may be included in a hands-freetelephone or other device. The microphone 3 converts the voice of theuser 1 into electrical signals processed by an audio circuit 6.

[0010] Speech signals from the audio circuit 6 are indicative of thesounds voiced by the user 1, and are decoded by a decoder 9. Many formsof a decoder 9 are used in the art. As described above, a decoder 9utilizes a model after a training operation has been performed. Theaudio circuit 6 and a decoder 9 maybe provided, for hands-free telephoneapplications in the environment of voice processing circuit boards. Oneexample of a voice processing circuit board is the model D/41EPCI4-PCIvoice processor made by Dialogic Corporation. This circuit board is a 4port analog voice processing board incorporating an architecture knownunder the trademark Signal Computing System Architecture™ (SCSA™).Particular inputs provided and outputs generated in accordance with thepresent invention are further discussed below. A particular circuitboard cited above allows interaction of the basic circuit software andhardware with additional boards. Consequently, a prior art circuit maybe straightforwardly modified in accordance with the present invention.

[0011] The present system is directed toward recognizing names. Fullsentences will not be resolved from the audio input. Consequently, nolanguage model will be used during decoding. Similarly, no parser isused after decoding. Therefore, language models and parser informationwill not be used to provide information in confidence scoring. What isbeing provided from the decoder 9 is output indicative of acoustic levelinformation, namely phone and word level features. Processing is,therefore, simplified.

[0012] The basic phone feature is p(x|u), which is the probability thatinformation based on a number of speech frames x corresponds to aparticular model u of a given phone. In a preferred form, the decoderalso generates the basic level word score p(x|w) which is the likelihoodthat a certain group of frames x represents a word w. Individualphonemes do not work well is practice as units on which to baserecognition because there is too much variation due to articulatoryeffect from adjacent sounds. Since only acoustic level information isbeing obtained, further terms must be generated on which anapproximation of p(x|u) can be based. A “standard” feature set used formany speech recognizers is a 39 dimensional feature vector consisting of12 mel frequency cepstral coefficients normalized log energy, Δ and ΔΔcoefficients. The cepstrum separates vocal tract parameters from theirexitation signal. Lower order coefficients in the cepstrum representslowly varying vocal tract parameters and remaining coefficients modelthe quickly varying exitation signal and pitch. Speech recognition isgenerally only interested in the first 13 ceptsral coefficients. Lowerorder coefficients are often omitted in favor of the energy measures ofthe higher orders. In the present embodiment, the twelfth ordercoefficients are included in the calculations of the decoder 9.Mel-frequency cepstral coefficients approximate the critical band of thehuman auditory system by warping the frequency access prior to lineartransform.

[0013] The front end of a speech recognizer represents only a portion ofthe computational power needed to obtain results with a known confidencelevel. Therefore, in accordance with the present invention, furthermeasures are generated which may be efficiently utilized for derivingfurther features at the work level. These features may be processed in acomputationally efficient manner. These features are generated by avector generator 12 which receives input from the decoder 9. Theseinputs are described in greater detail with respect to FIG. 2.

[0014] The vector generators are provided to a classifier 15 which is aconfidence measure feature vector classifier. The classifier 15 may bean artificial neural network (ANN). Alternatively, the classifier 8could comprise Linear Discriminant Analysis (LDA) linear classifier. Ithas been found that the ANN is preferred.

[0015] In each type of classifier 15, confidence feature vectors arereduced to one dimension. This one dimensional vector is provided as anoutput of the classifier 15 to a threshold circuit 17. An accept/rejectthreshold T is defined and embodied in the threshold circuit 17. If theoutput of the classifier 15 equal to or greater than T, an accept signalis provided by the threshold circuit 17. If the output of the classifier15 is less than T, then a reject signal is provided by the thresholdcircuit 17.

[0016] For name dialing, a false alarm is more troublesome than a falseaccept. Therefore, the system is optimized to reject as manymisrecognized and out of vocabulary words as possible while keeping thefalse alarm rate at a low level. A working level is set at a false alarmrate of less than 5%. To further reduce false alarms, in a preferredform, the system is configured to provide a prompt from a false alarmcircuit 18 providing an output to a speaker 20 that provides auralinformation to the user 1. On a first rejection of a word, false alarmcircuit 18 is triggered to provide an output to the user 1, such as bythe speaker 20, prompting the speaker to repeat the name correspondingto the keyword associated with the number to be called. The user thansays the name again. If a reject signal is again provided by thethreshold circuit 17, the user is informed to use manual service ratherthan automatic dialing to reach a particular party.

[0017] In one exemplification, an auto-attendant system was providedhaving a vocabulary of hundreds of words. A large vocabulary continuousspeech recognition (LVCSR) from Intel modified as disclosed above wasprovided. Speech feature vectors included 12 MFCC, 12 Δ and 12 ΔΔ.Cepstral Mean Subtraction (CMS), Variance Normalization (VN) and VocalTract Normalization (VTN) were not used. Confidence measure vectors fortraining, cross-validation and testing data were obtained over a sixmonth period. There were 11,147 utterances for training, 1,898utterances for cross-validation and 998 utterances for testing. Theworking point set was with a false alarm rate of less than or equal to5%. Results were as follows: Data Sets False Alarm Correct RejectTraining 0.0449 0.5114 Develop 0.0355 0.5096 Test 0.0443 0.5295

[0018] These vectors are also known as evidence vectors. Valuesdeveloped by the vector generator 12 are described. The definitionsbelow comprise instructions to those skilled in the art to produce theproper coding in accordance with known principals in specific languages.One such suitable language is C++. The following are derived

[0019] Mean₁₃ nscore: The mean of normalized log scores of all phonemesin the first choice word, divided by frame number.

[0020] Mean_log_nscore: The mean of normalized log scores of allphonemes in the first choice word divided by frame number.

[0021] Minimum_log_nscore: The minimum normalized phoneme log score fromall phonemes in the first choice word, divided by frame number.

[0022] Dev_log_nscore: The standard deviation of phoneme normalizedscoresof all phonemes in the first choice word, divided by frame number.

[0023] In a further embodiment, the following values are also derived

[0024] Align_W_Nbest_Rate: The rate of first choice keyword appears inthe N-best hypotheses with same word end time.

[0025] W_Nbest_Rate: The rate of first choice keyword appears in theN-best hypotheses, time alignment is not required, because there is onlyone keyword.

[0026] Begin_W_Active_States: The number of active states at beginningframe of first choice keyword.

[0027] End_W_Active_States: The number of active states at ending frameof first choice keyword.

[0028] W_Active_States: The average number of active states acrossframes of first choice keyword.

[0029] Phone_duration: The average duration of all phones in firstchoice keyword.

[0030] The first choice word is any word within the vocabulary embodiedin the decoder 9 recognized as being the model most closelycorresponding to the sequence of phones resolved by the decoder 9.

[0031] Further useful features can also be obtained by appropriateprogramming of the vector generator 12 namely:

[0032] Align_W_Nbest_Rate: The rate of first choice keyword appears inthe N-best hypotheses with same word end time.

[0033] W_Nbest_Rate: The rate of first choice keyword appears in theN-best hypotheses, time alignment is not required, because there is onlyone keyword.

[0034] Begin_W_Active_States: The number of active states at beginningframe of first choice keyword.

[0035] End_W_Active_States: The number of active states at ending frameof first choice keyword.

[0036] W_Active_States: The average number of active states acrossframes of first choice keyword.

[0037] A summary of operation is seen in FIG. 2, which illustrates themethod and computer product of the present invention. Decoding isperformed at block 30. At block 32, vectors are generated by the vectorgenerator 32 where they may be also normalized as at block 34 vectorsare analyzed, block 36, by the classifier 15, which is provided to thethresehold ciruit 17, at block 38, which may comprise a switch totrigger acceptance or rejection at block 40.

[0038] What is thus provided is a computationally efficient, robustmethod and apparatus for rejection of unreliable speech recognitionresults. Effective results are obtained without having to usecomputational resources needed for processing based on the incorporationof word recognition models. In accordance with the above teachings, anumber of different parameters may be utilized in the generation of aconfidence signal while not departing from the present invention. Thepreferred embodiments as set forth herein are illustrative and notlimiting.

What is claimed is:
 1. A system for rejection of unreliable speechrecognition results comprising: a speech recognition decoder to providephone and word level information based upon an audio input; a speechinformation processor being programmed to produce phone and wordparameters; and a neural network classifier to receive evidence vectorsand to calculate error signals, said neural network providing errorsignals comprising confidence information on which a word rejectiondecision may be based.
 2. The system according to claim 1 furthercomprising a threshold circuit for receiving the error signals from saidneural network classifier and comparing the value thereof to apredetermined level for providing a signal indicative of acceptance orrejection of a word.
 3. The system according to claim 2 furthercomprising a rejection switch to reject said word in response to anoutput from said threshold circuit indicative of a determination ofrejection.
 4. The system according to claim 1 wherein said decoderembodies speech modeling to utilize 12 MFCC, 12 Δ and 12 ΔΔ valuesandwherein cepstral mean subtraction, variance normalization and vocaltrack normalization are not included in the speech vector.
 5. The systemaccording to claim 4 wherein said speech information processor isconfigured to compute Mean_nscore: The mean of normalized log scores ofall phonemes in the first choice work, divided by frame number;Mean_log_nscore: The mean of normalized log scores of all phonemes inthe first choice work divided by frame number; Minimum_log_nscore: Theminimum normalized phoneme log score from all phonemes in the firstchoice word, divided by frame number; and Dev_log_nscore: The standarddeviation of phoneme normalized scores of all phonemes in the firstchoice word, divided by frame number.
 6. The system according to claim 5wherein said speech information processor is also configured to compute:Align_W_Nbest_Rate: The rate of first choice keyword appears in theN-best hypotheses with same word end time; W_Nbest_Rate: The rate offirst choice keyword appears in the N-best hypotheses, time alignment isnot required, because there is only one keyword; Begin_W_Active_States:The number of active states at beginning frame of first choice keyword;End_W_Active_States: The number of active states at ending frame offirst choice keyword; W_Active_States: The average number of activestates across frames of first choice keyword; and Phone_duration: Theaverage duration of all phones in first choice keyword.
 7. A methodcomprising: decoding phonemes resolved from a speech input measuring theprobability p(x) that the decoder has determined a phoneme based on apreselected acoustical model; generating evidence parameters;normalizing said evidence parameters; and calculating error signalsbased upon said evidence vectors.
 8. A method according to claim 7wherein computing evidence parameters comprises generating the phonemeparameters: Mean_nscore: The mean of normalized log scores of allphonemes in the first choice work, divided by frame number.Mean_log_nscore: The mean of normalized log scores of all phonemes inthe first choice work divided by frame number; Minimum_log_nscore: Theminimum normalized phoneme log score from all phonemes in the firstchoice word, divided by frame number; and Dev_log_nscore: The standarddeviation of phoneme normalized scores of all phonemes in the firstchoice word, divided by frame number.
 9. A method according to claim 8further comprising measuring a probability of correct detection of aword based on acoustical model, the word comprising a sequence ofphonemes; normalizing said probability; and calculating evidence vectorscomprising word parameters and utilizing said word parameters inaddition to said phoneme parameters to generate error signals.
 10. Amethod according to claim 9 wherein calculating word parameterscomprises calculating Align_W_Nbest_Rate: The rate of first choicekeyword appears in the N-best hypotheses with same word end time;W_Nbest_Rate: The rate of first choice keyword appears in the N-besthypotheses, time alignment is not required, because there is only onekeyword; Begin_W₁₃ Active₁₃ States: The number of active states atbeginning frame of first choice keyword; End_W_Active_States: The numberof active states at ending frame of first choice keyword;W_Active_States: The average number of active states across frames offirst choice keyword; and Phone_duration: The average duration of allphones in first choice keyword.
 11. A method according claim 10 furthercomprising taking said error signals and rejecting or not rejectinginput indicative of a word in accordance with a level of an error signalassociated with the resolved word.
 12. A computer program productcomprising: a computer usable medium having computer readable programcode embodied in said medium to utilize audio inputs to produce aprobability indicative of measurement of a phoneme; computer readableprogram code to cause the computer to produce error vectors based oncalculated evidence parameters calculated from said probability signal;and computer readable program code to cause said computer to provide anerror signal value on which acceptance or rejection of a measure phonemeis based.
 13. A computer according to claim 12 wherein said computerreadable program code comprise means for calculating Mean_nscore: Themean of normalized log scores of all phonemes in the first choice work,divided by frame number; Mean_log_nscore: The mean of normalized logscores of all phonemes in the first choice work divided by frame number;Minimum_log_nscore: The minimum normalized phoneme log score from allphonemes in the first choice word, divided by frame number; andDev_log_nscore: The standard deviation of phoneme normalized scores ofall phonemes in the first choice word, divided by frame number.
 14. Acomputer program product according to said claim 13 further comprisingcomputer readable program code to cause a computer to produce aprobability of a measured word according to an acoustic model based on asequence of phonemes and wherein said computer readable program codemeans to determine phoneme parameters further comprises computerreadable code for determining word parameters.
 15. A computer programproduct according to said claim 13 wherein said computer readableprogram code for generating word parameters comprises means to generatethe following values Align_W_Nbest_Rate: The rate of first choicekeyword appears in the N-best hypotheses with same word end time;W_Nbest_Rate: The rate of first choice keyword appears in the N-besthypotheses, time alignment is not required, because there is only onekeyword; Bein_W_Active_States: The number of active states at beginningframe of first choice keyword; End_W_Active_States: The number of activestates at ending frame of first choice keyword; W_Active_States: Theaverage number of active states across frames of first choice keyword;and Phone_duration: The average duration of all phones in first choicekeyword.
 16. A computer program product according to claim 15 whereinthe computer program code medium is embodied in a circuit providingvectors to a neural network classifier.